Dataset: Medical Information Mart for Intensive Care-III (MIMIC-III)
On the development and validation of large language model-based classifiers for identifying social determinants of health
NLP Tasks: Text Classification, Information Extraction
Method: LLM-based classifiers using Bidirectional Encoder Representations from Transformers (BERT) and A Robustly Optimized BERT Pretraining Approach (RoBERTa)
Metrics:
- Area under the receiver operating characteristics curve for homelessness (0.78)
- Area under the receiver operating characteristics curve for food insecurity (0.72)
- Area under the receiver operating characteristics curve for domestic violence (0.83)
Extraction of Substance Use Information From Clinical Notes: Generative Pretrained Transformer-Based Investigation
NLP Tasks: Text Classification, Information Extraction, Question Answering, Text Generation
Method: the generative pretrained transformer (GPT) model in specific GPT-3.5
Metrics:
- Accuracy (high accuracy in zero-shot learning)
- Recall (improved in few-shot learning)
- F1-score (enhanced in few-shot learning)
- Precision (lower in few-shot learning)
The potential and pitfalls of using a large language model such as ChatGPT, GPT-4, or LLaMA as a clinical assistant
NLP Tasks: Text Classification, Information Extraction, Question Answering
Method: Evaluation of ChatGPT, GPT-4, and LLaMA in identifying patients with specific diseases using gold-labeled Electronic Health Records (EHRs) from the MIMIC-III database.
Metrics:
- F1-score (≥ 85% on COPD, CKD, and PBC)
- F1-score (4.23% higher for PBC compared to traditional Machine Learning models)
- Precision
- Specificity
- Sensitivity
- Negative Predictive Value
ARDSFlag: an NLP/machine learning algorithm to visualize and detect high-probability ARDS admissions independent of provider recognition and billing codes
NLP Tasks: Text Classification, Information Extraction
Method: ARDSFlag algorithm using machine learning (ML) and natural language processing (NLP) techniques
Metrics:
- Accuracy (91.9%±0.5% for bilateral infiltrates, 86.1%±0.5% for heart failure/fluid overload in radiology reports, 98.4%±0.3% for echocardiogram notes)
- Overall accuracy (89.0%)
- Specificity (91.7%)
- Recall (80.3%)
- Precision (75.0%)
Redefining Health Care Data Interoperability: Empirical Exploration of Large Language Models in Information Exchange
NLP Tasks: Information Extraction, Text Generation
Method: text-based approach facilitated by the LLM ChatGPT
Metrics:
- Accuracy (over 99%)
- Accuracy (NAME: 10.2%, NAME+SYN: 36.1% with typos, NAME+SYN: 61.8% with typo-specific fine-tuning)
- Accuracy (NAME: 11.2%, NAME+SYN: 92.7% for unseen synonyms)
Constructing synthetic datasets with generative artificial intelligence to train large language models to classify acute renal failure from clinical notes
NLP Tasks: Text Classification, Information Extraction, Question Answering
Method: A classifier using language models to identify acute renal failure.
Metrics:
- Accuracy (over 99%)
- Accuracy (NAME: 10.2%, NAME+SYN: 36.1% with typos, NAME+SYN: 61.8% with typo-specific fine-tuning)
- Accuracy (NAME: 11.2%, NAME+SYN: 92.7% for unseen synonyms)
Exposing Vulnerabilities in Clinical LLMs Through Data Poisoning Attacks: Case Study in Breast Cancer
NLP Tasks: Text Classification, Information Extraction, Question Answering
Method: nan
Metrics:
- Accuracy (over 99%)
- Accuracy (NAME: 10.2%, NAME+SYN: 36.1% with typos, NAME+SYN: 61.8% with typo-specific fine-tuning)
- Accuracy (NAME: 11.2%, NAME+SYN: 92.7% for unseen synonyms)
A Large Language Model Screening Tool to Target Patients for Best Practice Alerts: Development and Validation
NLP Tasks: Text Classification
Method: AI screening tool using the BioMed-RoBERTa model
Metrics:
- Accuracy (over 99%)
- Accuracy (NAME: 10.2%, NAME+SYN: 36.1% with typos, NAME+SYN: 61.8% with typo-specific fine-tuning)
- Accuracy (NAME: 11.2%, NAME+SYN: 92.7% for unseen synonyms)
Evaluation and mitigation of the limitations of large language models in clinical decision-making
NLP Tasks: Information Extraction, Text Classification, Question Answering
Method: Creating a framework to simulate a realistic clinical setting using a curated dataset based on the Medical Information Mart for Intensive Care database
Metrics:
- Accuracy (over 99%)
- Accuracy (NAME: 10.2%, NAME+SYN: 36.1% with typos, NAME+SYN: 61.8% with typo-specific fine-tuning)
- Accuracy (NAME: 11.2%, NAME+SYN: 92.7% for unseen synonyms)
Learning to Make Rare and Complex Diagnoses With Generative AI Assistance: Qualitative Study of Popular Large Language Models
NLP Tasks: Information Extraction, Text Classification, Question Answering, Text Generation
Method: Evaluation of three popular large language models (LLMs): Bard, ChatGPT-3.5, and GPT-4, using various prompt strategies and a majority voting strategy.
Metrics:
- Accuracy (over 99%)
- Accuracy (NAME: 10.2%, NAME+SYN: 36.1% with typos, NAME+SYN: 61.8% with typo-specific fine-tuning)
- Accuracy (NAME: 11.2%, NAME+SYN: 92.7% for unseen synonyms)
The shaky foundations of large language models and foundation models for electronic health records
NLP Tasks: Information Extraction, Text Classification, Question Answering
Method: conducting a narrative review and creating a taxonomy of foundation models trained on non-imaging EMR data
Metrics:
- Accuracy (over 99%)
- Accuracy (NAME: 10.2%, NAME+SYN: 36.1% with typos, NAME+SYN: 61.8% with typo-specific fine-tuning)
- Accuracy (NAME: 11.2%, NAME+SYN: 92.7% for unseen synonyms)